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Abstract— Machine learning and Artificial Intelligence have significantly advanced in recent years owing 
to their potential to considerably increase the quality of life while reducing human workload. The paper 
demonstrates how AI and ML are used in the drug development process to shorten and enhance the overall 
timeline. It contains pertinent information on a variety of Machine Learning approaches and algorithms 
that are used across the whole drug development process to speed up research, save expenses, and reduce 
risks related to clinical trials. A range of QSAR analysis, hit finding, and de novo drug design applications 
are used in the pharmaceutical industry to enhance decision-making. As technologies like high-throughput 
screening and computation analysis of databases used for lead and target identification and development 
create and integrate vast volumes of data, machine learning and deep learning have grown in importance. 
It has also been emphasized how these cognitive models and tools may be used in lead creation, 
optimization, and thorough virtual screening. In this paper, problem statements and the corresponding 
state-of-the-art models have been considered for target validation, prognostic biomarkers, and digital 
pathology. Machine Learning models play a vital role in the various operations related to clinical trials 
embracing protocol optimization, participant management, data analysis and storage, clinical trial data 
verification, and surveillance. Post-development drug monitoring and unique industrially prevalent ML 
applications of pharmacovigilance have also been discussed. As a result, the goal of this study is to 
investigate the machine learning and deep learning algorithms utilised across the drug development 


lifecycle as well as the supporting techniques that have the potential to be useful. 


Keywords—Machine Learning, Artificial Intelligence, Drug Discovery, Drug Development, 


Pharmacovigilance 


I. INTRODUCTION 


Over the last ten years, machine learning (ML) has been 
more popular in the area of medicine. Since the middle of 
the 20th century, machine learning has been explored, but 
recent advances in computing power, data accessibility, 
cutting-edge techniques, and a broad variety of technical 
expertise have expedited its use in healthcare. 


Machine learning methods have been used by drug 
companies since 1962. These techniques make it easier to 
gather pertinent characteristics, which advances our 
knowledge of complex biological systems. The 
pharmaceutical industry is increasingly using many 
prediction models to enhance the drug development 
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process. We can finally acquire answers to topics that 
present a higher challenge to chemists, all thanks to the 
algorithms used by various computational methodologies. 
They aid chemists in accurately modeling, analyzing, and 
forecasting a variety of biological responses with regard to 
drug design. With the help of the annotated data, machine 
learning algorithms learn intricate patterns to predict the 
annotations of new test data sets [1]. Genome association, 
protein function prediction, and other tasks involve the 
application of machine learning. It helps in comprehending 
a diverse array of drug features such as solubility, binding, 
and target-related assays. Despite the positive results, it is 
never easy to apply machine learning to the complex 
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problem of drug development. Drug development, in 
contrast to other areas, has unique challenges in choosing 
an appropriate representation for the targets in a 
medication, such as the molecules and their complexes that 
are important to the drug's intended effect. The lack of 
bio-activity descriptions is one of the biggest problems. It 
is crucial to consider how to use the data at hand to 
accomplish the desired result. 


Therefore, determining the correct representation is always 
the most difficult task. The training data is particularly 
important for machine learning techniques. This is made 
much more difficult by the fact that the data used to make 
the majority of the forecasts is often inconsistent, noisy, 
and imprecise. It may become much more challenging due 
to the scarce and uneven data produced by the chemical 
experiments that were conducted. Recently, computational 
approaches to deal with these challenges have been 
created. Drug discovery and development may be sped up 
in different ways by increasing the use of machine learning 
to bioactivity data. 


Target Target 
Discovery Validation 


Due to the significant time and financial commitment, the 
process of developing new drugs is exceedingly 
challenging. Finding a drug to combat a target often takes 
a huge number of years and billions of dollars. Even then, 
regardless of a great deal of effort, the success rate is 
extremely low. There is a risk that many long-term 
research endeavors may fail, wasting tremendous effort. 
Bestseller drugs are those that are frequently prescribed for 
common conditions like the flu, diabetes, high blood 
pressure, asthma, cold, etc. They are quite successful in the 
pharmaceutical sector and generate great annual revenues 
and daily profits. However, if the drug exhibits any side 
effects, it might also pose problems for the company. 
Drugs typically face competition from less-priced 
substitutes when their patents expire. As a result, finding 
new drugs is a difficult and risky process that is constantly 
driven by the potential good it could do for millions of 
individuals suffering from various ailments. The life cycle 
of drug development is shown in Figure 1 


Studies 


Fig 1. Drug Discovery and Development Life Cycle 


1. The first stage is target discovery. We now select the 
illness target upon which to focus drug development. The 
target helps us better understand how parasite infection 
affects genes, proteins, RNA, and other cellular 
components. 


2. Phase two entails verifying the accuracy of the intended 
target. During this stage, the discovered target is verified 
to confirm that the drug being developed addresses the 
right problem. 


3. Discovery of HITs is the third phase. In this step, we 
synthesize and purify the intended target-interacting 
chemical compounds. Chemists and assay developers work 
together to test the chosen substances at this step. 


4. The fourth stage is the Hit to lead transformation. This 
phase involves finding prospective lead compounds from 
the molecules discovered as part of High Throughput 
Screening (HTS) in the previous stage. 


5. The fifth stage is lead optimization. This phase is 
designed to provide a better and safer scaffold by 
minimizing structural alteration while eliminating the 
undesirable effects of the current active analogs. 
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6. The stage 5 is pre-clinical studies. This comprises 
identifying medications and comprehending drug 
mechanisms for reasonable patients, as well as applying 
biomarkers to increase the effectiveness of clinical trials. It 
clarifies us on the disease's activity and allows for more 
precise functional imaging of its response to the drug 
created to treat it. 


7. The following step, clinical trials, involves testing the 
drug on human subjects. If the medicine achieves its 
intended results, then the process is complete. 


8. Post-development Monitoring and pharmacovigilance 
are the process's last steps. Medical professionals may 
clinically prescribe the drug after it has been evaluated and 
given FDA approval. After that, the drug is put on the 
market for consumer purchase, and it needs to be 
monitored continuously. 


It takes many years to successfully complete each of these 
phases. Continuous research is being done to increase the 
efficiency and speed of this procedure. This paper is aimed 
at discussing and reviewing the various applications of 
cognitive sciences and machine learning in order to drive 
productive benefits for the drug development lifecycle in 
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the pharma industry. The rest of the essay is divided into 
the following sections: With sections describing the ML 
models used in each phase and the constraints they impose, 
Section II sheds insight on the types of ML algorithms 
employed in different phases of drug development. This 
section is then followed by a conclusion and a list of 
references. 


Il. ML ALGORITHMS USED IN VARIOUS 
STAGES OF DRUG DEVELOPMENT 


Drug development has considerably advanced because of 
Machine Learning algorithms. Consequently, the use of 
multiple ML algorithms in drug discovery has significantly 
benefited pharmaceutical companies. ML algorithms have 
been used to construct many models for predicting the 
chemical, biological, and physical characteristics of 
compounds used in drug discovery. Over the duration of 
drug discovery, these trained models will become 
invaluable. Machine learning has been put to use in the 
pharmaceutical sector for a variety of purposes, such as 
drug efficacy identification, drug-protein interaction 
prediction, safety biomarker confirmation, and molecule 
bioactivity enhancement. Several ML methods have seen 
extensive usage in the pharmaceutical industry recently. 
These include the support vector machine (SVM), random 
forest (RF), and naive bayesian (NB). 


Figure 2 depicts the four main categories of machine 
learning algorithms: supervised, unsupervised, semi- 
supervised, and reinforcement learning. [2][3]. Input data 
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must be provided for supervised learning, along with the 
expected results. During the training phase, it also looks 
after delivering accuracy rate predictions. Before using the 
method on new test data, the features, instances, and 
models to be utilized must be established. Learning can be 
stopped once performance reaches an acceptable level. 
The supervised learning framework can be categorized as 
either classification or regression problems. Any situation 
where the output is a category falls under the classification 
problem, for example, YES or NO. A real-valued output 
for instance height, weight, etc. falls under the category of 
the regression problem. Unsupervised algorithms, on the 
other hand, don't need to be trained for the intended result. 
They model the underlying distribution via an iterative 
process, giving them the opportunity to understand the 
data better. These problems are classified as association or 
clustering problems. We aim to define the rules for 
understanding the vast data by defining the inherent 
groupings in the data in clustering and by doing the same 
in the association. Moreover, semi-supervised learning 
employs input data with just a subset of labels for training. 
Many of these issues really occur often in the real world. 
To solve these issues, researchers use both supervised and 
unsupervised study methods. In reinforcement learning, 
the observations derived from environmental interaction 
are utilised. The reinforcement learning system repeatedly 
takes up new information from the environment until risk 
is reduced. To learn the behavior of the environment, it 
makes use of a feedback signal called a reinforcement 
signal. 


=. 


Fig 2. Types of Machine Learning Algorithm 
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There are various machine learning algorithms[4][5], some 
of the popular approaches are: 


e Decision Trees: This model uses data about various 
decisions and their respective potential outcomes to form a 
tree-like graph. By eliminating the low-value branches, 
pruning can help a tree function better. This minimizes 
both the over-fitting and the tree's complexity. 


e Naive Bayes Classification: These Bayes theorem-based 
classifiers are typically used when the inputs have a large 
dimensionality. In comparison to other, more complex 
models, this one has the greatest result. 


e K-means Clustering: This technique aids in grouping the 
data, and K stands for the group number. It uses given 
features to iteratively assign each data point to a group. 
The collected data is then clustered in terms of shared 
characteristics. K-means clustering returns the data's labels 
and cluster axes' centres. 


e Logistic Regression: Methods of statistical analysis used 
to identify the significance of one or more independent 
variables in a data collection. It is a way to describe data 
that is used for prediction, with the goal of learning more 
about the association between a binary variable and other 
independent factors. 
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e Support Vector Machines: In order to maximize the 
separation between classes, this technique aims to 
categorize and model the training data into a decision 
boundary. For cases when linear data separation is not an 
option, the kernel function is used. 


e Neural Networks: These are parameterized non-linear 
algorithms that classify input data at each layer using a 
multi-layer perceptron. The accuracy of the model is 
determined by the perceptron’ and hidden layers’ numerical 
values. 


2.1. Drug Discovery 


A useful classification of the literature review is made 
possible by the application of ML at every step of drug 
development, from target identification and validation 
through hit discovery and hit-to-lead optimization through 
pre-clinical trials. The drug design methodologies rely on 
datasets that were created using various ML algorithms. 
When ML algorithms are properly trained, verified, and 
used across the drug development phase, they may 
expedite error-prone, previously difficult procedures and 
provide insightful findings. The majority of drug design 
processes now incorporate ML approaches to cut down on 
time and manual interference, hence leading to optimal 
results and timelines. 
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Fig 3. Machine Learning Models used in various stages of the Drug Discovery Process 


2.1.1 Target discovery 


The first step in the target identification and 
characterization procedure is to determine the function and 
significance of a gene or protein that may be used as a 
therapeutic target. After a target has been identified, the 
molecular pathways, it is expected to effect may be 
described. Effectiveness, safety, and compliance with 
clinical and business needs are some of the qualities of a 
good target. 
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Producing drugs (small molecules, peptides, antibodies, or 
more advanced techniques like short RNAs or cell 
therapies) that will change the disease state by modifying 
the activity of a biological target is the main objective of 
drug discovery.[6]. The selection of a target with a valid 
therapeutic hypothesis, that is, that modifying the target 
would modify the disease state, is still important before 
beginning a drug development program, despite the recent 
revival of phenotypic screens. Target identification and 
prioritization is the process of choosing this target based 
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on the information available. The next stage is to 
demonstrate the selected target's involvement in the illness 
using ex vivo and in vivo models that are physiologically 
relevant (target validation). Early target validation is 
essential to concentrate on high-probability projects even 
if clinical trials will eventually confirm the target. 


The pharmaceutical sciences devote considerable attention 
to the study of drug-target interactions (DTI) [7]. The 
procedure of discovering new medicines is costly and 
time-consuming. Therefore, the ability to anticipate drug- 
target interactions is useful to biologists since it allows 
them to focus their research. The first and most important 
step in the drug development process is determining the 
desired effects of the medicine. Medicable proteins that 
play a role in illness make up the majority of these areas of 
intervention. Drug-target interaction prediction is used to 
find novel treatment approaches. Potential targets include 
proteins with enzymes, ion channels, G protein-coupled 
receptors (GPCRs), and nuclear receptors. Certain ligands 
may alter the functioning of these groups. As a result, 
studying the genomic space generated by these protein 
classes enables us to precisely predict the likelihood of an 
interaction. Drug discovery and drug repositioning for 
novel targets are both possible using DTI. The three main 
categories of DTI prediction tools are ligand-based, 
docking-based, and chemogenomic strategies. Similarity 
between ligands and target proteins is used as a predictor 
of DTI in the Ligand-based Approach [8], [9]. The target 
protein's three-dimensional structure may be used to 
identify the probability of a pharmacological interaction. 
The Docking-based approach is used for this, which 
considers the relative stability and binding affinity of the 
proteins [10, 11]. If the drug's chemical information, 
protein genomic data, and known DTIs are all considered, 
the chemogenomic technique is then used.[12][13]. In a 
ligand-based technique, a target with a small number of 
binding ligands frequently yields subpar DTI predictions. 
This is a shortcoming of this approach. Similarly, the 
docking-based approach is time-consuming and depends 
on the target proteins' 3D structures. Due to these 
drawbacks, the chemogenomic technique has recently 
gained popularity for the identification of DTI. The DTI 
problem is presented as a machine learning problem using 
this method, and a classifier is often created and trained 
using publicly accessible interaction data. In order to 
forecast the unknown interactions, this classifier is 
used[13]. The chemogenomic fully utilized a number of 
techniques. Bipartite graphs[14], | recommendation 
systems[15], and supervised classification problems are 
some of these[16]. However, when we look at the data, we 
can see that there will only be a small number of favorable 
interactions, and the other possible interactions are 
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unknown. For instance, there could only be 7000 favorable 
drug interactions out of the 35 million potential drug 
options[17]. The two types of computational 
chemogenomic methodologies are feature-based and 
similarity-based strategies. Features are the inputs for a set 
of instances defined by a specific class label for feature- 
based methods. In most cases, the targets are the features 
and the instances are the drugs. The presence of a possible 
associations is represented by the binary value of the class 
label. Support vector machines, decision trees, and random 
forests are a few examples of feature-based classification 
techniques.[18]. Drug-target interactions are often 
identified using Support Vector Machine or Random 
Forest. [19]. 


Particularly, the target identification portion of the drug 
development process significantly relies on the 
categorization of biomedical data. The classification of 
biomedical data, which is sometimes replete with 
irrelevant information and data known as noise, has shown 
excellent potential when using Naive Bayes classifier (NB) 
algorithms [20]. Lead discovery might be considerably 
enhanced by applying NB approaches to predict ligand- 
target interactions.[21]. In recent years, researchers have 
been able to use NB strategies in many areas of the drug 
development process. In a research aimed at finding new 
breast cancer therapies, Pang et al. [22] employed NB 
models and other methods to categorize compounds 
according on their potential efficacy as estrogen receptor 
antagonists. "The model produced impressive results when 
used in conjunction with other techniques, such as the 
extended-connectivity fingerprint-6. In a study by Wei et 
al. [23], potential drugs that would be effective against the 
targets of the hepatitis C virus and human 
immunodeficiency virus type 1 were predicted using a mix 
of NB and support vector machine methods. Their 
approach included two distinct descriptor systems, one of 
which was the extended-connectivity fingerprint-6, with 
NB as a classifier technique. Utilizing NB in conjunction 
with other approaches and technologies has proven 
effective in implementing drug discovery processes. 


2.1.2 Target validation 


The concept of creating a medicine for a certain target is 
also an important consideration for the pharmaceutical 
companies. For instance, identifying targets with 
properties that imply that these proteins can bind tiny 
molecules is necessary for small-molecule drugs [24]. 
These druggable models can be created using various 
target attributes. Using the physicochemical, structural, 
and geometric characteristics of 1,187 drug-binding and 99 
non-drug-binding cavities in a sample of 99 proteins, 
Nayal and Honig [25] created a random forest classifier. 
The most important characteristics were the size and shape 
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of the surface voids. On the basis of the protein sequences 
of well-known drug and non-drug targets, some studies 
have used SVMs [26][27] or biassed SVMs with stacked 
autoencoders, a Deep Learning model [28] to forecast 
druggable targets to assess different physicochemical 
properties. Additionally, it has been discovered that 
druggable proteins tend to be strongly linked and occupy 
certain areas of protein-protein interaction 
networks[29][30][31]. These ML methods reduced the 
search area by generating lists of drug-binding targets, but 
further research is needed to confirm these forecasts. 


The holy grail of target identification or validation, namely 
the ability to accurately anticipate the outcome of a drug's 
clinical trials in advance, has not yet been attained. 
Success indicators have been the subject of several non- 
ML studies. Rouillard et al.[32] evaluated 332 targets that 
were either successful or unsuccessful in phase III clinical 
trials by analyzing their omics data using ML and selecting 
multivariate characteristics. The gene expression data, 
which was characterized by low mean RNA expression 
and considerable heterogeneity across tissues, was shown 
to be a strong predictor of effective targeting. This work 
provided more evidence that optimal targets are expressed 
selectively in diseased tissues [33]. In order to anticipate 
de novo therapeutic targets, Ferrero et al. [34][35] trained a 
variety of ML classifiers utilizing target-disease linkages 
from the open target's platform. It was determined that 
regardless of indication, the three most essential data 
categories for therapeutic target prediction are gene 
expression, genetic data, and the availability of an animal 
model. . The sparseness of the data and the lack of 
knowledge regarding the causes of failed programs, 
however, pose a limitation to this technique. 
Fundamentally, because it takes years to build a good drug 
discovery plan and finally bring it to market, successful 
programs are a reflection of earlier drug development 
models. Considering the arrival of more recent therapeutic 
modalities like biologics (including antibodies), it's 
doubtful that the elements that contributed to the success 
of small-molecule projects in the past will be relevant in 
the present. Additional restrictions are imposed by 
precision medicine's growing importance. For future 
prediction tools to be effective, large amounts of 
information on both successful and failed drug 
development projects must be made accessible with 
metadata in the public domain. 


2.1.3 Hit Discovery 


It is essential to execute comprehensive virtual and 
experimental high-throughput screening of large chemical 
pools to identify treatment candidates that inhibit or 
activate the target protein of 
pharmacodynamic, pharmacokinetic, and toxicological 


interest. The 
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characteristics of candidate structures are further 
improved, as well as their target specificity and selectivity. 
It is important to note, however, that there may be a lack of 
enough high-quality data in this domain, which might limit 
the use of ML to new chemistry. This is especially true for 
macrocycles and proteolysis-targeting 
(PROTACSs). 


chimaeras 


For ligand-based virtual screening, a lot of attention has 
been paid to the use of Deep Learning models, such as 
multi-task neural networks. Computational methods may 
use a particular lead molecule to identify physically 
comparable compounds’ with similar chemical 
characteristics. The use of multi-task DNNs has shown to 
be more effective than standard statistical approaches for 
this job, which was previously performed. When it comes 
to predicting the properties and functionalities of small 
substances, DNNs may greatly increase predictive power 
[37]. One-shot learning may significantly reduce the 
amount of information needed to accurately forecast how a 
molecule would read out in a new experimental 
environment. The binding mechanism of opiates to the - 
opioid receptor was previously unknown, however a 
Markov State model and Machine Learning approaches 
were able to pinpoint an allosteric region implicated in this 
activation [38]. The advantages of multi-task models over 
single-task models, however, vary depending on the data 
set. The evaluation of ML algorithms has made use of 
MoleculeNet [39], a large benchmarking data set produced 
by Pande et al. to assist in the comparison of different ML 
algorithms. Data on the characteristics of more than 
700,000 molecules can be found in MoleculeNet. The 
open-source DeepChem package now includes all of these 
hand-picked data sets along with a number of additional 
number of advantages. 


Planning effective chemical synthesis routes can also be 
done using DNNs and contemporary tree search 
techniques. A target molecule is formally deconstructed 
using reversed processes in order to plan its production 
(retrosynthesis). In order to synthesise the target, this 
method creates a sequence of processes that may be carried 
out in a straightforward way in the laboratory. The 
systematic application of synthetic chemistry skills to this 
method is a tremendous task. The exponential growth of 
chemical knowledge and the inadequate understanding of 
the range and boundaries of many reactions have made the 
manual insertion of transformation rules impracticable. A 
database called Reaxys (with 11 million reactions and 
300,000 rules) was utilised by Segler et al.[40] to 
automatically extract the rules. He used a Monte Carlo tree 
search (MCTS) to weight the tree's nodes and DNNs in 
order to determine the most profitable paths for further 
research. This strategy outperforms the industry norm for 
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best first search in quantitative analyses using two 
different implementations (heuristic method and neural). 
Furthermore, for around two-thirds of the examined 
chemicals, MCTS is 30 times quicker than conventional 
computer-aided search methods. In a double-blind 
experiment, qualitative evaluations were also incorporated. 
Organic chemists were required to pick between expected 
and literature-based synthesis paths in a blind process. For 
the first time, chemists agreed that the predicted routes’ 
quality was, on average, on par with routes selected from 
the literature. 


2.1.4 Hit to Lead 


This procedure is sometimes referred to as "lead 
generation" in the early phases of drug development 
research. Insufficient optimization during the High 
Throughput Screen (HTS) to find potential lead 
compounds leads to the discovery of molecules, or "hits." 
Using a preexisting kinase inhibitor library, the "design 
layer'/Random Forest regression mapping method is used 
to construct new chemical spaces with biological activity. 
This method of optimising hits into leads is a practical use 
of chemical synthesis. [41] 


By adjusting or rebalancing the target interest, de novo 
drug design produced distinct chemical structures [42]. By 
starting from scratch, de novo techniques introduce new 
molecules using a fragment-based methodology. If the 
molecular structure has impracticalities and complexity at 
this stage [43], the risk emerges in the structure's 
development and the evaluation of bioactivity becomes 
challenging. In order to develop a novel structure with the 
necessary properties, deep learning models could be used 
in terms of their extensive knowledge and generative skills 
[44]. The use of reinforcement learning in molecular de 
novo design is another significant application of Machine 
Learning. By modifying a sequence-based generative 
model to produce molecules with almost ideal values for 
solubility, pharmacokinetic characteristics, bioactivity, and 
other factors, researchers at AstraZeneca were able to 
expand the chemical space. Similar models were created 
by Kadurin et al. utilizing deep GANSs to extract chemical 
features from very huge data sets [45]. It's important to 
keep in mind, however, that reinforcement learning may 
not be helpful when trying to find previously undiscovered 
synthetic pathways. 


Olivecrona et al’s[46] expansion of the use of deep 
reinforcement learning to the prediction of biological 
activities in the creation of new drugs included some RNN 
model modifications. To understand the SMILES syntax, 
an RNN model must be trained; chemBL compounds may 
be gathered for training. Agents take engage in activities in 
reinforcement learning under certain rules. If the agent is 
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rewarded enough, the trend of their actions will be revived. 
[47]. Use the SVM methodology to improve a few 
methods based on the ligands concept in the training set in 
order to achieve a high benefit for activity scoring. Create 
a few compounds that are antagonistic to the dopamine 
receptor 2-type before using the RNN and deep 
reinforcement learning model. Additionally, it was noted 
that with SVM's scoring capability, predictions for 
structures in the bioactive region have exceeded 95%. The 
auto-encoders method can be used to produce unique 
molecules by employing deep learning algorithms. Then, 
Gomez-Bombarelli et al. [48] used the multilayer 
perceptron (MLP) and variational autoencoder (VAE) to 
automatically produce new molecules with the required 
characteristics. 


Kadurin et al. [49] used on the AAE model, now known as 
druGAN, to create molecular fingerprints. The AAE 
approach produced impressive results when applied to the 
VAE model in terms of power production, reconstruction 
inaccuracy, and subsequent extraction effectiveness. Coley 
et al. [50] proposed analysing the synthetic molecule to 
determine whether it was accessible synthetically. As a 
result of the great approximation capabilities for producing 
synthetic complexity measures, he postulated that the 
neural network was trained in line with the response 
database. The product complexity score must be higher 
than the reactant complexity score for a synthetic reaction 
to be successful. [51]. In order to demonstrate correlation 
inequality between the complexity of the products and 
reactants, Coley made several efforts to develop a scoring 
function by encoding chemical responses into pairs of 
products and reactants. In order for neural networks to feel 
at ease with any kind of scoring capability at that time, 
they must be trained using the reactant and product 
pairings that Coley utilised across a scope of 22 million. 
Additionally, the synthesis process's conclusion was 
established with a great deal of complexity. Finally, 
generative models reveal both the complexity of the 
synthetic process caused by eliminating the non-realistic 
molecules and the pharmacological actions in inverse 
synthetic planning. 


How to adequately explain the chemical structure is a 
challenge in small molecule design. There are several 
representations to select from, ranging from simple 
circular fingerprints like the extended-connectivity 
fingerprint (ECFP) to intricate symmetry functions [52]. 
Which structural representation is best for every small- 
molecule design challenge is yet unclear. It would be 
fascinating to see whether the extensive body of ML 
research in cheminformatics provides any new information 
on the most efficient approach for 
representation. 


structural 
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2.1.5 Lead Optimization 


The optimization of potential drug leads is an important 
step in the drug development process. If a fragment has the 
potential to be used in medicinal chemistry, it will be 
evaluated as a potential future step in the research process. 
Lead optimization aims to offer a better and safer scaffold 
by reducing structural modification and removing negative 
effects of existing active analogues. An illustration of this 
is the advancement of Autotoxin inhibitors, such as the 
investigational drug GLPG1690, in human clinical trials to 
treat pulmonary fibrosis. Figure 4 below provides an 
overview of the factors that can make active analogs more 
potent by using customized methods. Here, we evaluate a 
substance's ADME/T features, including its toxicity, 
chemical makeup, physical attributes, and rates of 
absorption, distribution, metabolism, and excretion. 


ADME 
Features 


Chemical 
and physical 
properties 


Fig 4. Factors Affecting Lead Optimization 


e = Chemical and physical properties 


Physical and chemical properties have been used in the 
drug development process to lessen the number of 
significant failure. To this end, researchers have turned to 
lead optimization strategies powered by deep learning 
models [53]. Due to their dependence on the 
interpretability principle, Duvenaud et al. [54] directly 
gathered data from the molecular graph using the CNN- 
ANN idea to generate a prediction, ie. (MAE = 
0.53+0.07). Duvenaud's study was motivated by Coley et 
al efforts’ to improve molecular aqueous approaches. 
Additionally, the tensor-based convolutional approach was 
used, and the improved results were MAE (0.424+0.005). 


Clearly defining molecular graph attribution is crucial 
since tensor-based approaches must incorporate properties 
like bonds and atom levels. To predict molecular aqueous 
solutions, Coley's model utilised a lot more atom-level 
information than Duvenaud's [55]. It was shown that Caco- 
2 permeability coefficients had a good correlation with 
oral drug absorption (P app) for predicting the candidate 
drug while pharmacokinetic parameters were being 
evaluated.[56, 57]. Using the Caco-2 permeability data, 
Wang et al’s [58] attempt to generate 30 descriptors’ 
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worth of prediction templates necessitated the building of 
1,272 components, including models like SVM regression 
and boosting. In the test set, where it also had the 
maximum expectation capacity, the boosting model fared 
the best. It conforms with the OECD's (Organization for 
Economic Co-operation and Development) standards for 
promoting reliability and logical arguments since it 
adheres to the QSAR principles set out by the OECD. 


e Absorption, distribution, metabolism, and excretion 


Injecting pharmaceuticals or treatments into a person's 
veins is a method of absorption. Bioavailability parameter 
is used to examine the level of absorptions. Several clinical 
departments explained how to increase absorption 
properties using molecular predictions for bioavailability 
[59]. Tian et al. used 1,014 compounds to predict 
bioavailability using molecular resources and structural 
fingerprints using the MLR model. The predicted 
performance of applying the genetic function approach 
was excellent, with RMSE = 0.2355 and a correlation 
value of 0.71. The distribution of medications or 
treatments within the human body is influenced by 
intracellular and interstitial fluids as well as specific drug 
absorption characteristics.[60]. The steady state 
distribution of a drug is the amount of drug that makes it 
from the in vivo phase to the plasma reaction (VDss). The 
steady phase is a crucial indicator for evaluating the drug 
distribution mechanism. Lombardo and Jing used 1,096 
molecules and the PLS and Random Forest methodologies 
to make predictions about VDss. [61]. The board members 
in this case are dissatisfied with the prediction findings 
since 50% of the compounds are accessible in a twofold 
mistake. Because of the presence of such obscure 
components, VDss may be affected. The use of VDss in 
molecular structure data is intentionally put to the test by 
this issue. Any drug or treatment taken by a person under 
these circumstances will try to produce the already- 
existing toxic metabolite as a result of the metabolic 
system's inbuilt redundancies. It is important to maintain 
the integrity of the metabolic structure, hence structural 
optimization methodologies are utilised to motivate the 
metabolism to make very accurate prediction. Numerous 
machine learning (ML) methods were used to predict 
specific enzymes, such as UDP- 
glucuronosyltransferases (UGTs), cytochrome P450s, etc., 
using a vast quantity of drug metabolism data. In addition, 


metabolic 


Xenosite's platform has UGT-trained neural networks for 
predicting UGT metabolism [62][63][64]. When a drug is 
digested, it is eliminated from the body in a process known 
as excretion. Because certain medications are soluble in 
water, water may be used to flush them out of the body, or 
in the absence of metabolism, the metabolites can be 
eliminated directly. The PCA method was utilised by 
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Lombardo et al. to get excellent results in innovative 
approaches, with a predicted rate of accuracy of 84%.[65]. 


e Toxicity and the ADME/T multi-task neural networks 


In clinical and preclinical damage completion, about one- 
third of the most important compounds utilised in drug 
localization were shown to be inadequate. Risks were 
reduced by improved toxicity prediction and molecular 
optimization [66]. Kidney and liver toxicity profiles are 
among those that may be predicted using tools like 
structural warnings and rule-based expert knowledge. To 
improve the accuracy of toxicity predictions, deep learning 
models are required. Similarly, Xu et al., anticipating 
results from CNN molecular graph encoding, created the 
acute-oral toxicity prediction model (MGE-CNN). When 
compared to the SVM model, predicted results were 
shown to be better [67]. The similarity in training neural 
networks feature extraction, model construction, and 
molecule encoding resulted in the success of the MGE- 
CNN model. Due to the adaptability of the MGE-CNN 
model, the issue was reformulated in terms of molecular 
fingerprints. To categorise TOX Alerts and collect high- 
quality data on structural alerts, Xu et al. [68] employed 
hazardous characteristics for fingerprints. When 
comparing parameters, multi-task neural networks that 
have been trained to retrieve comparable characteristics 
outperform single-task neural networks. [69]. This is 
because the neural networks are more supportive of 
multiple tasks and share parameters. The human body 
receives data after the drug's absorption, distribution, 
metabolism, and excretion have all been taken care of and 
prediction has been enhanced using multi-tasking neural 
networks. In this study, Kearnes et al. examined single- 
task and multi-task performance using ADME/T 
experimental data. The results demonstrated that the multi- 
task approach was superior. [70]. 


2.1.6 Pre-Clinical Studies 


Through the use of ML models, biomarker discovery 
increases the effectiveness of clinical trials by identifying 
drugs and understanding how they work for reasonable 
people [71][72][73]. The completion of a clinical trial 
requires a lot of money and time. Throughout the first 
stages of clinical trials, expected models must be used, 
developed, and validated in order to solve this problem. In 
preclinical data collection, ML systems enable the 
prediction of translational biomarkers. Following data 
validation, corresponding biomarkers and models may 
examine the patient's symptoms and provide a treatment 
strategy. Although many scholarly articles proposed 
predictive models and biomarkers, only a few of those 
articles were actually implemented in clinical trials. For a 
clinical situation, it is required to consider model 
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development, design, data access, data quality, software, 
and model selection. The main problem was how ML 
methods evaluated the effectiveness of community-driven 
efforts to create regression and classification models. The 
US Food and Drug Administration-led (MAQC ID 
MicroArray Quality Control [74] analyzed ML algorithms 
for predicting gene expression data in the last step of 
clinical trials a number of years ago. 36 independent 
organizations that examined 6 microarray data sets created 
predictive models for categorizing clinical locations 
nearing completion of development. Information simulates 
the best methods for clinical trials by including high- 
quality data, trained scientists, and control systems. 
Patients with multiple myeloma had poor prognosis and 
their treatment was stopped after 24 months owing to an 
incomplete application. Multiple myeloma and gene 
expression are continuous variables, hence their future 
behavior may be predicted using a _ regression-based 
method. A gene expression profile may be utilized in 
combination with Cox regression models to identify a 
patient's illness risk factors, as has been shown [75]. Here, 
the advantage of using regression models is emphasized 
due to the lack of specified classes that might perform 
prediction in clinical trials. [76][77][78][79]. The National 
Cancer Institute (NCI) finds it challenging to develop 
medicine prediction models in order to assess regression 
models. [80]. The best model with key parameters must be 
used to acquire training data (for instance, treating 35 
breast cancer cells with 31 medicines), and models must be 
validated using identical blind testing data (i.e., treating 18 
breast tumour cells with similar 31 drugs). Using data from 
six different data profiles—RNA sequencing, RNA 
microarray, reverse protein phase array, SNP (Single 
Nucleotide Polymorphism) array, DNA methylation status, 
and exome sequencing—better prediction algorithms are 
created". These profiles are used to conduct multivariate 
statistical analyses on 44 sets of data using a variety of 
regression models, including sparse linear regression, 
kernel methods, regression trees, and principal component 
analysis. The MAQC II findings showed that certain 
groups performed very well, while other groups utilized 
similar models. While some teams concentrated on 
technical issues like feature selection, quality control, data 
reduction, modifying ML parameters, and splitting 
strategy, others utilized biological information like gene 
expression data to set themselves apart from the 
competition. A huge number of medications are feasible 
for creating a prediction model when compared to other 
approaches. 


2.2. Clinical Trials 


Proper Clinical drug development follows the completion 
of preclinical research and includes investigations with 


Page | 24 


Jain / Application of Machine Learning in Drug Discovery and Development Lifecycle 


human volunteers and clinical trials to further perfect the 
drug for human consumption. The intricacy of clinical trial 
design, the cost of conducting such a study, and the 
difficulties inherent in putting it into practise are all factors 
that may affect trials performed at this level. Trials must 
be safe and effective, done within the budget given for 
drug development, and adhere to a certain methodology to 
guarantee the medicine is useful and practical for its 
intended application. For this rigorous process to be 
successful, it needs to be properly set up and involve a 
significant volunteer base. In order to successfully carry 
out these various tasks involved in clinical trials as shown 
in Figure 5; ML algorithms have been extensively used in 
each sphere thereby aiding the process as a whole. 


CLINICAL TRIAL 
PARTICIPANT MANAGEMENT 


VERIFICATION, AND 


4 STUDY DATA COLLECTION, 
SURVEILLANCE 


Fig. 5 Various Tasks associated with Clinical Trials 
2.2.1 Clinical study protocol optimization 


The success and effectiveness of human clinical trials may 
be improved by using ML to ease the formulation of trial 
protocols in advance using simulation techniques on a 
significant quantity of data from prior studies. As shown in 
reinforcement learning approaches for Alzheimer's disease 
and non-small cell lung cancer [81, 82], study simulation, 
for instance, may optimize the selection of treatment 
regimens for trials. Researchers may submit protocols 
using AI, which employs natural language processing, in 
order to detect potential roadblocks and obstacles to 
successfully completing trials (such as inclusion/exclusion 
criteria or outcome indicators). Although the use of ML in 
research planning may theoretically ensure that a particular 
trial design is best suited to the needs of the stakeholders, 
this is just a promise since the effectiveness of these 
sample models has not been assessed in a peer-reviewed 
manner. In conclusion, machine learning clearly has the 
potential to improve the effectiveness and productivity of 
preclinical research and the planning of clinical trials. 
However, rather than focusing on the planning of clinical 
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trials, the great majority of peer-reviewed studies on the 
use of ML in this context are preclinical research and 
development-focused. This could be because there are 
more large, high-dimensional datasets available in 
translational contexts or because using ML in clinical trial 
settings comes with higher costs, hazards, and regulatory 
requirements. We require scholarly research on the 
effectiveness of ML in clinical trial design to solve these 
challenges. 


2.2.2 Clinical trial participant management 


Clinical trial participant management involves selecting 
research populations, enlisting patients, and maintaining 
their participation. Despite significant investment in 
participant management, studies often run over budget, 
take longer than expected, or fail to provide useful data 
due to patient drop-out and non-adherence. There is a total 
failure rate of 13.8 percent for medications evaluated in 
phase I, with estimates indicating that between 33.6 
percent and 52.4 percent of clinical studies underpinning 
drug development that take place during stages 1-3 are 
unsuccessful.. [83]. ML techniques can help with 
participant identification recruiting and retention 
and choosing the study demographics of patients. If 
individuals were more carefully selected for trials, the 
sample size required to detect an impact may be less. 
Alternatively stated, improved techniques of selecting the 
patient population may lead to fewer individuals being 
offered therapies for which they are not likely to improve 
outcomes. Previous studies have indicated that for every 
expected response, there are anywhere from three to 
twenty-four non-responders for the most commonly 
prescribed drugs, making progress in this field a continual 
challenge. Many people who use these drugs end up 
having unintended consequences [84]. Unsupervised 
machine learning of patient populations may quickly 
analyze large databases of existing research and in turn aid 
with patient population selection as well as reveal patterns 
in patient features that may be utilized to choose patient 
phenotypes that are best suited for treatment [85]. A cross- 
modal inference learning model technique may more 
successfully match patients to trials using EHR data by 
concurrently encoding enrollment criteria (text) and patient 
records (tabular data) into a shared latent space [86]. The 
utility of these procedures is questioned by the lack of 
peer-reviewed documentation of their development and 
performance measures [87]. Mendel.AI and Deep6AI are 
two businesses that provide comparable services. This 
method may have the benefit of not requiring participants 
to be identified precisely by structured data fields, which 
has been demonstrated to dramatically skew trial 
populations. [88, 89]. There are two basic strategies to 
boost retention and policy adherence using ML models, as 
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shown by the monitoring of participants and their 
adherence to protocols. The first stage is to use ML to 
identify, investigate, and penalise participants who are 
likely to infringe upon the terms of the research. The 
second approach is to use ML to make the research easier 
for participants and to improve their overall experiences. 
AiCure is a company that employs face recognition 
technology to track whether or not patients really take their 
prescription. AiCure was shown to be more efficient than 
directly observed modified therapy in detecting and 
improving patient adherence in studies on schizophrenia 
patients and recent stroke survivors using anticoagulation 
[90, 91]. AiCure's performance may differ across patient 
subgroups since its model building and validation 
technique is not publicly known, as has been shown in 
previous computer vision applications. [92]. Additionally, 
data obtained during routine clinical care might be 
analyzed using ML techniques to provide data that can be 
utilized for study. For instance, rather of exposing all 
participants to the additional strain and cost of more in- 
depth and multiplexed imaging, generative adversarial 
network modeling of typically clinically stained slides 
with hematoxylin and eosin may identify the ones who 
need it. [93]. Natural Language Processing may also make 
it simpler to repurpose clinical data for research purposes 
by automatically filling out study case report forms when 
used often with the Unified Medical Language System 
[94]. There are two examples of how patients produce 
useful content outside of the clinical trial context that ML 
can process into study data to lessen the burden of data 
collection for trial participants: natural language 
processing of social media posts to identify serious drug 
reactions with high fidelity [95]. The International 
Parkinson and Movement Disorders Society's Unified 
Parkinson's Disease Rating Scale has been found to 
correlate participant activity with wearable device data, 
which can also be used to distinguish between 
neuropsychiatric symptom ontology patterns, identify 
patient falls, and identify participant activity [96]. In 
summary, ML and NLP have shown promise for a number 
of tasks related to improved participant management in 
clinical trials; nevertheless, additional research comparing 
various approaches to participant management is required 
to further improve clinical trial quality and participant 
experience. 


2.2.3 Data collection and management 


Applying ML to clinical trials has the potential to enhance 
the methods used to gather, handle, and analyse trial data. 
ML techniques can also aid in addressing some of the 
challenges related to collecting real-world data and dealing 
with corresponding missing data. Wearable and other 
mobile/electronic device data on patients' health may 
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supplement or even replace data collected via more 
conventional means, such as in-person visits for a research. 
The usage and validation of new, patient-centered 
biomarkers may be made possible by wearables and other 
devices. When creating new "digital biomarkers" from the 
data acquired by the device's numerous sensors, ML 
processing is often necessary since the data provided by 
mobile devices might be sparse and inconsistent in quality, 
accessibility, and synchronisation (such as cameras, audio 
recorders, accelerometers, and 
photoplethysmography).Therefore, in order to analyse the 
massive and complicated data created by wearables and 
other devices, appropriate data collecting, storage, 
validation, and analysis procedures are required [97]. 
Patients with atopic dermatitis had their accelerometer data 
processed using a recurrent neural network [98], a mobile 
single-lead electrocardiogram platform's input was 
processed using a deep neural network, and an audio signal 
from a Parkinson's disease patient was processed using a 
random forest model. [99]. These cutting-edge digital 
biomarkers might make clinical studies run more smoothly 
and with a focus on patients, but there are risks associated 
with this strategy. Although this risk exists for all data, 
regardless of processing technique, using machine learning 
to evaluate wearable sensor output to define research goals 
involves the possibility of producing false results, as was 
shown to happen with an electrocardiogram classification 
model[100]. Lack of awareness of participant privacy 
attitudes in relation to the sharing and use of device data, 
as well as a lack of a precise description of the overlap 
between authorised clinical aims and patient-centric digital 
biomarkers, are obstacles to ML processing of device data 
implementation. 


2.2.4 Study data collection, verification, and 
surveillance 


An intriguing use of ML is in automating data collecting 
into case report forms, which may save time, money, and 
human error in either prospective trials or retrospective 
evaluations. Specifically, Natural Language Processing is 
very important for this kind of data administration. 
Depression [101], epilepsy [102], and cancer [103] are just 
a few examples of diseases where this application has 
showed early promise despite having to overcome varied 
data formats and provenances. Regardless of the method 
used for data collection, ML might support risk-based 
monitoring techniques for clinical trial surveillance. This 
allows for the avoidance or early detection of issues like 
site failure, fraud, and inconsistent or nonsensical data that 
might otherwise delay database lock and subsequent 
analysis. For instance, when people fill out case report 
forms, the accuracy of the information acquired for result 
determination may be evaluated by combining optical 
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character recognition with natural language processing 
(usually supplied in PDF form). Clinical trials and 
observational studies may benefit from auto-encoders 
since they can be used to identify potentially fraudulent 
data patterns by classifying them as plausible or 
improbable [104]. Endpoint detection, adjudication, and 
safety signal detection are all examples of how machine 
learning may be used in data processing. Currently, events 
are manually adjudicated by a committee of doctors. 
However, there may be time, money, and complexity 
savings with semi-automated endpoint identification and 
adjudication. While categorising events into useful 
categories has typically been the domain of semi- 
automated ML systems, adjudicating endpoints has 
historically required a significant amount of human labor. 
Although this technique has not been peer-reviewed, 
IQVIA Inc. has described the capacity to automatically 
treat certain adverse events connected to pharmacological 
therapy utilizing a mix of optical character recognition and 
natural language processing [105]. A classification model 
would theoretically need to be retrained for each new 
experiment due to the fact that endpoint criteria and the 
data needed to support them often alter across research. 
This might be a roadblock in the way of fully automated 
event adjudication (which is not a viable approach). 
Although not all studies adhere to these objectives, there 
have been recent attempts to standardize outcomes in the 
area of cardiovascular research. The majority of areas have 
not combined trial data to enable model training for 
cardiovascular endpoints [106]. For this area to go further, 
stakeholders must establish consensus definitions, really 
accept the definitions of events, and be prepared to provide 
the right data from several trials for model training. 


The issue of missing data may be solved using different 
ML applications. This may be accomplished by thinking 
about the data's context, the assumptions and objectives 
made about the data, the methods used to acquire the data, 
and the analyses that will be conducted. Goals could 
include computing other important quantities by averaging 
over a large number of potential values from a learning 
distribution or directly calculating precise estimates of the 
missing covariate values. Though more modern 
approaches are still in their infancy and thorough 
comparisons are needed, preliminary studies show that 
complex ML methods may not always be superior than 
simple imputation strategies like the population mean 
estimate. [107]. One use of missing value algorithms is the 
analysis of sparse datasets like those found in registries, 
electronic health records, ergonomic studies, and data 
collected from wearable devices. [108][109]. Data 
augmentation solutions may mitigate the effects of missing 
data or values, but they should be used with caution lest 
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they lead to models that are only partly generalizable to 
newly collected data that has inherent flaws. Therefore, 
using ML to enhance data gathering while conducting 
research itself might be a more fruitful approach. 


2.2.5 Data analysis 


Rich sources of information for study design, risk 
modeling, and counterfactual simulation include data 
collected in clinical trials, registries, and clinical practices. 
These projects are ideally suited for machine learning. 
Unsupervised learning, for instance, might find phenotypic 
clusters in real-world data that can be explored further in 
clinical research [110]. Additionally, ML has the potential 
to advance the established practice of secondary trial 
analysis by more accurately identifying treatment 
heterogeneity while still providing some (albeit 
insufficient) protection against false-positive findings, 
thereby revealing more intriguing areas for further research 
[111]. Additionally, machine learning may provide risk 
predictions that may be evaluated in the future with the 
proper utilization of previous data. For instance, a random 
forest model in the COMPANION trial data performed 
better at identifying individuals who might benefit from 
cardiac resynchronization treatment than a multiple 
logistic regression [112]. The results demonstrated that 
random forests may explain feature interactions that are 
often missed by simpler models. 


ML shows considerable promise in this area by increasing 
the precision with which it can distinguish real-world 
evidence from real-world data, even if it is still a highly 
desired (and extremely difficult) objective (i.e., draw 
causal inferences). A vital and important endeavor is the 
creation of predictive models that can predict future 
occurrences. A few of the methods suggested in the 
literature include optimal discriminant analysis, targeted 
maximum likelihood estimate, and propensity score 
weighting made possible by ML [113][114]. 


The use of ML to provide counterfactual policy estimates, 
where existing data is used to anticipate outcomes under 
circumstances that do not now exist or may not, is 
particularly fascinating. For instance, reinforcement 
learning suggests better treatment plans based on prior 
unsuccessful treatments and outcomes, and trees of 
predictors may provide survival predictions for heart 
failure patients under the conditions of obtaining or not 
receiving a heart transplant. [115]. Risky data sharing 
agreements that restrict the amount of data accessible for 
model training and a lack of compliance with EHR data 
systems are the key obstacles to adoption. [116]. In 
conclusion, there are many efficient ML algorithms for 
managing, processing, and analyzing data from clinical 
trials, but there are much less methods for enhancing data 
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quality from the start. High-quality trials must be 
conducted in order to enable more advanced ML 
processing since the availability and quality of data are the 
foundations of ML techniques. 


2.3 Post Drug Development Sector and 
Pharmacovigilance 


Once the results of clinical studies have been compiled and 
the treatment has been developed to achieve maximal 
effectiveness and safety, the FDA will move it forward for 
comprehensive assessment. Currently, the FDA examines 
the drug application that the pharmaceutical company has 
submitted and decides whether to approve it or not. Once 
the pharmaceutical company has received permission, it 
can start selling drugs and continue to manage its products. 


There is a completely different sector or area of 
technology, processes, and advanced improvements that 
open up once the drug hits the market and is ready for use. 
Figure 6 below shows some of the practical illustrations of 
how companies can and have applied AI and ML 
technologies to the  post-drug development and 
pharmacovigilance arena. 
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Fig 6. Machine Learning Applications in 
Pharmacovigilance 


e Predictive Analytics 


ML aids in making predictions based on that analysis 
while AI aids in managing enormous amounts of data. The 
time it takes for new drugs to reach the market has been 
cut in half with the help of AI and ML. Typically, the 
lifecycle of a drug design lasts 10—15 years[117]. Artificial 
intelligence and machine learning allow specialists to use 
statistical models to learn from the past, present, and 
future, speeding up the process of discovering and testing 
new treatments. SciBite makes the most of the predictive 
analytics that AI and ML offer [118]. The company 
reduced the amount of time it took for new 
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pharmaceuticals to hit the market by integrating AI into its 
R&D methodology. According to New York University, 
80 percent of clinical data is unstructured [119]. To speed 
up operations in the post-drug development sector, AI and 
ML are the tools that can operate with such a vast 
information segment. 


e Social Listening for Accurate Health and Drug- 
Related Information 


Social media sites may provide a wealth of vital 
information if the correct tools are available. The article 
that was published following the Pacific Symposium on 
Biocomputing demonstrates how AI can provide important 
insights into the effectiveness of antidepressants by 
analyzing five million posts[120]. The study also 
emphasizes the value of social listening in identifying drug 
safety combinations and adverse drug reactions (ADRs). 
In fact, researchers were able to identify some new side 
effects of prescription medications. They used artificial 
intelligence (AI) to examine public posts from users and 
learned various patterns. 


e Smarter Individual Case Safety Report (ICSR) 
Collection 


ICSR report collection constitutes a significant problem. 
An even bigger challenge is their analysis. Over 20 million 
ICSRs are said to be stored in the WHO database, which 
might prove to be an invaluable resource for studying drug 
side effects and other potential dangers. According to a 
study published in the journal Clinical Pharmacology & 
Therapeutics[121], the ICSR collection process is made 
smarter overall with the use of AI and ML. The experts 
forecast that ICSR reporting would be significantly more 
advanced than it is presently by 2030. Massive amounts of 
unstructured text in ICSRs can be analyzed using AI-based 
technologies like Natural Language Processing (NLP), 
resulting in ICSR management that is enhanced by AI. 


e Cloud-Based Reporting 


AI pharmacovigilance and cloud computing go hand in 
hand. The experts believe that cloud technology will be 
used to gather and analyze data. It is anticipated that the 
cost-efficiency, scalability, and simplicity of 
Pharmacovigilance will increase with the integration of 
cloud-based computing with AI and ML. 


e Personalized Medicine 


In order to create individually customized treatments, 
personalized drugs can be done by identifying a person's 
biological, physical, physiological, and genetic markers. 
Healthcare practitioners may evaluate thousands of 
markers using artificial intelligence in pharmacovigilance 
to produce considerably more precise predictions about 
how particular drugs will affect particular people[122]. 
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Pharmacological therapy will inevitably become more 
personalized, reducing ADRs and boosting drug 
effectiveness. 


e Nanomedicine and Drug Delivery 


Nanomedicine is currently a reality, not just a concept 
from science fiction. The research, which was published in 
the academic journal Drug Discovery Today[117], 
demonstrates how pioneers are using nanotechnology and 
medicine in tandem to identify, treat, and keep track of a 
variety of complex illnesses. Specialists deal with asthma, 
cancer, malaria, and HIV. Although nanomedicine is still 
in its infancy, advances have been made in the field of 
medication delivery by nanoparticle modification. 
According to a recently released study from a scholarly 
magazine[123], engineers and scientists are striving to 
construct implantable nanorobots that will improve drug 
delivery. Fuzzy logic, integrations, and neural networks 
are examples of AI techniques that can simplify the overall 
process. 


HI. CONCLUSION 


The pharmaceutical industry is experiencing difficulties 
with drug development projects due to rising drug 
development costs and fewer chances of discovering new 
drug molecules. This finding has led to an increase in the 
number of pharmaceutical corporations and research 
institutions investigating the use of ML and robotics 
techniques to hasten the development of novel therapies 
and make the exchange of observational data and clinical 
trial outcomes easier. Multiple points in a drug's life cycle 
are amenable to ML algorithms. This has been 
demonstrated in detail in the preceding sections, where we 
discussed ML applications beginning with the drug 
discovery phases, such as target prediction and validation, 
discovering therapeutic and toxicity effect profiles of 
drugs, for prediction of, structure, bioactivity, and mode of 
action. More data on high-risk populations, long-term 
effects, food and drug interactions, and the escalation of 
known and unknown adverse effects of the drug over time 
are revealed by post-market drug monitoring. The use of 
ML in drug post-market monitoring increases compliance 
adherence and reduces expenditure significantly for each 
and every ICSR. Pharmacovigilance that uses AI can 
classify the harmful nature of reported events in addition 
to evaluating their quality. 


Despite the fact that ML, augmented intelligence, and a 
variety of medical data from around the world are paving 
the road for unified global healthcare, some challenges in 
the utilization of Machine Learning algorithms for drug 
development lifecycle still persist today. For the 
construction and training of ML models, high-quality, 
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precise, and painstakingly vetted data is necessary. The 
intricacy of the data type and the problem to be solved 
dictate the requirements for the necessary data quantity 
and accuracy. As a result, producing large data sets might 
be costly. It's important to keep in mind that when training, 
numerous neural network parameters are adjusted, some 
theoretical and practical frameworks for enhancing these 
models are not yet available. Another area where ML 
models fall short is in the prediction of novel paradigms. 
Because ML relies on training data to produce usable 
models, these models can only make predictions within the 
training data's predefined framework. 


Drug research might be sped up and saved money by using 
AI technology. Although ML might not be a solution for 
all issues in drug discovery, it is unquestionably a useful 
tool when used appropriately with the right data. The 
power of artificial intelligence (AI) technology will 
undoubtedly be used to complement human intelligence 
and enhance our capabilities, thereby transforming the way 
we approach drug development. 
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